Choosing priors for cutpoints in an ordinal logistic model is tricky. Michael Betancourt developed an indirect prior on the cut points that may be a good choice:

Applying the Dirichlet prior model to the full ordinal model doesn’t regularize the internal cut points directly, but it does regularize them indirectly through its regularization of the ordinal probabilities.

I didn’t know about this option when I built an ordinal logistic model, and I want to compare it to what I actually used. When I modelled lodgepole pine flowering phenology as a function of forcing units, I used

  • a gamma distribution for the prior on the cut points, \(c\)
  • an exponential distribution distribution for the transition speed/slope, \(\beta\)

I chose a gamma prior on the cutpoints because it was continuous and always positive and I could center it where I wanted relatively easily. I struggled to get the model to fit or provide consistent estimates, especially after adding groups.

Choosing priors

Normally, priors are bottom up - you choose a prior distribution for each parameter in your model. This is tricky for the cut points in an ordinal logistic model because they’re defined on an abstract latent space where your domain expertise is hard to apply. But you really want to choose a good prior in ordinal logistic models, because ordinal logistic models with covariates are inherently non-identifiable. (Because \(\beta\) and \(c\) depend on one another.) As Betancourt says -

To avoid the non-identifiability of the interior cut points from propagating to the posterior distribution we need a principled prior model that can exploit domain expertise to consistently regularize all of the internal cut points at the same time. Regularization of the cut points, however, is subtle given their ordering constraint. Moreover domain expertise is awkward to apply on the abstract latent space where the internal cut points are defined. (Betancourt)

The induced ordinal logistic prior on the cutpoints is a sort of top-down prior that works through the ordinal probabilities.

If I simulate data like mine using an ordinal logistic model, can I recapture the parameters using an ordinal logistic model with a gamma or induced dirichlet prior on the cut points?

Recapturing \(\beta\) and \(c\) well isn’t as important as recapturing the transition points, \(h\). \(h\) are the points at which half of the population has transitioned from one state to another and is calculated as \(h = \frac{c}{\beta}\).

Eventually I will want to consider how different groups affect \(h\) e.g \(h = \frac{c + \alpha_{site_i} + \alpha_provenance_i}{\beta}\)). Estimating \(\beta\), \(c\), and \(\alpha\) well will probably be more important then. But for now, let’s just consider an ordinal logistic model with a covariate/latent effect and no groups.

Goals

  1. When can I recapture parameters with a gamma prior?
  2. When can I recapture parameters with an induced dirichlet prior?
  3. Which one works best?

Simulate data

First let’s simulate some data that’s kind of like mine.

I have three potential states \(K\) and a covariate \(x\). \(x\) is always positive and measurements are only made when \(x\) is between about \(8\) and \(18\). Data collection begins when workers notice trees are approaching or entering state \(2\) and stops as trees enter state \(3\).

I simulate 3 datasets with

  • slow (\(\beta = 0.5\)),
  • medium (\(\beta = 1\)), and
  • fast (\(\beta = 2\)) transition speeds.

I want the transitions to occur at the same \(x\) for both datasets. The transition points \(h\) will be at x= 10 and x=15. \(h\) is calculated with cut points \(c\) and transition speed \(\beta\), that is \(h = \frac{c}{\beta}\) So while the transition points will stay the same in each simulated dataset, the cut points will differ.

Cut points for the slow transition dataset will be at * 5 and 7.5, at * 10 and 15 for the medium and at * 20 and 30 for the fast transition.

Here is what the first transition between states looks like for those transitions. My data is more like the fast transition - a phenological period with transitions of beta=0.5 would be occurring over an unrealistically long time period (like January to June instead of a 2 week period in late spring.

I simulated data with the Stan program

## // Simulate ordinal logistic data with a covariate
## 
## data {
##   int<lower=1> N;
##   int<lower=2> K;
## 
##   vector[N] x; //covariate
##   positive_ordered[K-1] c; //cutpoints
##   real beta; //slope
## }
## 
## generated quantities {
##   int<lower=1, upper=K> y[N];                     // Simulated ordinals
##   vector[N] gamma;
## 
##   for (n in 1:N) {
##     gamma[n] = x[n] * beta; // Latent effect
##     y[n] = ordered_logistic_rng(gamma[n], c); // ordinals
##   }
## }
## 
## SAMPLING FOR MODEL 'covar_sim' NOW (CHAIN 1).
## Chain 1: Iteration: 1 / 1 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 0 seconds (Warm-up)
## Chain 1:                0.000139 seconds (Sampling)
## Chain 1:                0.000139 seconds (Total)
## Chain 1: 
## 
## SAMPLING FOR MODEL 'covar_sim' NOW (CHAIN 1).
## Chain 1: Iteration: 1 / 1 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 0 seconds (Warm-up)
## Chain 1:                0.000137 seconds (Sampling)
## Chain 1:                0.000137 seconds (Total)
## Chain 1: 
## 
## SAMPLING FOR MODEL 'covar_sim' NOW (CHAIN 1).
## Chain 1: Iteration: 1 / 1 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 0 seconds (Warm-up)
## Chain 1:                0.00014 seconds (Sampling)
## Chain 1:                0.00014 seconds (Total)
## Chain 1:

Recapture parameters with gamma

Now I want to try to recapture parameters.

I need to know how good I have to be at choosing the shape and rate parameters for the gamma prior on the cut points to recapture the “true” slope and cut points parameters as well. (The mean of the gamma is the shape divided by the rate.)

Choosing where to center the gamma isn’t obvious since cut points are on that “abstract latent space.” Since \(c = \beta h\), though, we have some idea of the range of possible values.

If we center a gamma distribution around state \(2\), how much wiggle room does it need to find the cut points? When \(\beta=0.5\), \(12.5\) is \(6.25\) on the cut point space and when \(\beta=2\), \(12.5\) is \(25\) on the abstract latent space. The larger \(\beta\) is, the bigger and further apart the cut points are. The smaller \(\beta\) is, the smaller and closer together the cut points are. Since we don’t know \(\beta\) in advance, we probably have to choose a fat gamma so we can access the full range of values where cut points could be. On the other hand, we might end up estimating combinations of \(c\) and \(\beta\) that get \(h\) right, but both \(c\) and \(\beta\) very wrong.

(The more categories you have, the harder this will be).

I also want to understand how the exponential prior on \(\beta\) affects the ability of the model to recapture parameters. I’ll try rates (\(\lambda\)) of 1,2,3.

So there are three simulated datasets - one for each of three transition speeds. I’m going to try to fit them both with three rates for the \(\beta\)’s exponential prior and three shape and rate combinations for the cut points’ gamma prior, a total of 27 models. I’m going to do 10 runs of each model in Stan.

Prior parameters - gamma prior models. modelids are unique identifiers for each combination of prior parameters and simulated datasets, in order from fast transition to slow transition dataset
modelids gammaid shape cut_rate beta_rate
1, 10, 19 fat 1.250 0.10 1
2, 11, 20 middling 3.125 0.25 1
3, 12, 21 skinny 12.500 1.00 1
4, 13, 22 fat 1.250 0.10 2
5, 14, 23 middling 3.125 0.25 2
6, 15, 24 skinny 12.500 1.00 2
7, 16, 25 fat 1.250 0.10 3
8, 17, 26 middling 3.125 0.25 3
9, 18, 27 skinny 12.500 1.00 3

The Stan program implementing these models is

## 
## data {
##     int<lower=1> N;             // Number of observations
##     int<lower=1> K;             // Number of ordinal categories
## 
##     int<lower=1, upper=K> y[N]; // Observed ordinals
##     vector[N] x;                  // Covariate
##     real<lower=1> shape; // shape parameter for gamma prior on cutpoints
##     real<lower=0> beta_rate; // rate parameter for exponential prior on beta
##     real<lower=0> cut_rate; //
## }
## 
## parameters {
##     positive_ordered[K - 1] c; // (Internal) cut points
##     real<lower=0> beta; // population level effect
## }
## 
## model {
##     vector[N] phi;
## 
##     // Prior model
##     beta ~ exponential(beta_rate);
##     c ~ gamma(shape, cut_rate);
## 
## 
##     // Observational model
## 
##   for (i in 1:N ) {
##     phi[i] = beta * x[i];
##     y[i] ~ ordered_logistic(phi[i], c);
##     }
## 
## }
## 
## generated quantities {
##     //vector[N] gamma_ppc;
##     real<lower=0> h1;
##     real<lower=h1> h2;
##     //int<lower=1, upper=K> y_ppc[N];
## 
##     h1 = c[1]/beta;
##     h2 = c[2]/beta;
## 
##     // for (n in 1:N) {
##     //     gamma_ppc[n] = beta*x[n];
##     //     y_ppc[n] = ordered_logistic_rng(gamma_ppc[n], c);
##     // }
## }

Model check

Here’s a very cursory check for problems after fitting the models - divergences and issues with rhats and neff_ratios.

Models with divergences, rhats > 1, or neff_ratios < 0.1 - gamma
modelid bad_proportion bad_count divergences_mean bad_rhats_mean bad_neff_mean
1 0.3 3 0 2.333333 0.3333333
12 0.1 1 0 3.000000 0.0000000
18 0.2 2 0 2.000000 0.0000000
2 0.1 1 0 3.000000 0.0000000
20 0.1 1 0 1.000000 0.0000000
3 0.1 1 0 2.000000 0.0000000
4 0.1 1 0 1.000000 1.0000000
6 0.2 2 0 3.000000 0.0000000
7 0.2 2 0 3.000000 0.0000000
8 0.1 1 0 3.000000 0.0000000
9 0.4 4 0 2.000000 0.2500000

11 out of 27 models had problems with Rhat or neff_ratio in some number of the 10 model runs. That’s not ideal. No divergences ever though.

model parameterization-dataset combinations that never had a (very obvious) problem
gammaid shape cut_rate beta_rate transition beta c.1 c.2 h1 h2 modelid
middling 3.125 0.25 2 fast 2.0 20 30.0 10 15 5
fat 1.250 0.10 1 medium 1.0 10 15.0 10 15 10
fat 1.250 0.10 2 medium 1.0 10 15.0 10 15 13
fat 1.250 0.10 3 medium 1.0 10 15.0 10 15 16
middling 3.125 0.25 1 medium 1.0 10 15.0 10 15 11
middling 3.125 0.25 2 medium 1.0 10 15.0 10 15 14
middling 3.125 0.25 3 medium 1.0 10 15.0 10 15 17
skinny 12.500 1.00 2 medium 1.0 10 15.0 10 15 15
fat 1.250 0.10 1 slow 0.5 5 7.5 10 15 19
fat 1.250 0.10 2 slow 0.5 5 7.5 10 15 22
fat 1.250 0.10 3 slow 0.5 5 7.5 10 15 25
middling 3.125 0.25 2 slow 0.5 5 7.5 10 15 23
middling 3.125 0.25 3 slow 0.5 5 7.5 10 15 26
skinny 12.500 1.00 1 slow 0.5 5 7.5 10 15 21
skinny 12.500 1.00 2 slow 0.5 5 7.5 10 15 24
skinny 12.500 1.00 3 slow 0.5 5 7.5 10 15 27

Models with slow transition speeds were less likely to have problems. Slow transitions with “loose” priors on \(\beta\) run into trouble fitting- except when the cutpoints are highly constrained near the real values by a skinny gamma.

Parameter estimates

Parameter estimates for each run of each model are plotted below.

As you can see, not all runs of the same model on the same data produce the same results.

Plots of parameter estimates from models with different values for the prior. Dotted lines show models with poor model diagnostics - Rhat too big, Neff too small, etc.

Plots of parameter estimates from models with different values for the prior. Dotted lines show models with poor model diagnostics - Rhat too big, Neff too small, etc.

Plots of parameter estimates from models with different values for the prior. Dotted lines show models with poor model diagnostics - Rhat too big, Neff too small, etc.

Plots of parameter estimates from models with different values for the prior. Dotted lines show models with poor model diagnostics - Rhat too big, Neff too small, etc.

Plots of parameter estimates from models with different values for the prior. Dotted lines show models with poor model diagnostics - Rhat too big, Neff too small, etc.

Plots of parameter estimates from models with different values for the prior. Dotted lines show models with poor model diagnostics - Rhat too big, Neff too small, etc.

Plots of parameter estimates from models with different values for the prior. Dotted lines show models with poor model diagnostics - Rhat too big, Neff too small, etc.

Plots of parameter estimates from models with different values for the prior. Dotted lines show models with poor model diagnostics - Rhat too big, Neff too small, etc.

Plots of parameter estimates from models with different values for the prior. Dotted lines show models with poor model diagnostics - Rhat too big, Neff too small, etc.

Plots of parameter estimates from models with different values for the prior. Dotted lines show models with poor model diagnostics - Rhat too big, Neff too small, etc.

Some model runs do not recapture parameters, even when (an admittedly naive consideration of) diagnostics are good. In some cases, it looks like the model is exploring two “sides” of the parameter space and missing the middle (See cutpoint 1 graph slow transition - fat gamma, beta rate=1)

While fast transition datasets lead to wider estimates of \(\beta\) and \(c\), they have narrower estimates of \(h\). ## Recapture rate

Of the 5 parameters (\(\beta\), \(c\) and derived \(h\)), how many are recaptured by the model?

Most models return all parameters in the 90% HPDI, but a skinny gamma prior makes it harder to return parameters, especially when \(\beta\) deviates from 1.

Parameters are recaptured better when transition speed is fast, as long as the cutpoints prior isn’t too skinny.

Which parameters are easier or harder to recapture?

\(\beta\) and \(c\) are never recaptured in the 50% HPDI when the real \(\beta\) is slow, but they are usually recaptured in the 90% HPDI, except when the gamma prior on cutpoints is skinny.

A skinny gamma prior centered in between the transition points makes it difficult to recapture \(\beta\) and \(c\) across the board, unless \(\beta\) is near 1 (See “medium” models).

\(\lambda\) (rate parameter for the exponential prior on \(\beta\)) choice doesn’t seem to matter very much, but when transition speed is fast, a too loose (i.e small) \(\lambda\) makes it harder to recapture \(\beta\) and \(c\). When transitions are slow, a too tight (i.e large) \(\lambda\) might make estimates worse.

Half transition points \(h\)

Both \(h\) are always captured in the 90% HPDI, even when \(\beta\) and \(c\) are not recaptured. So the ratio between \(\beta\) and \(c\) is being captured, even when the actual values of those parameters aren’t.

\(h_2\) is estimated best (i.e in 50% HPDI) when transitions are fast and \(h_1\) is estimated best when transitions are slow. This suggests

Good estimates for \(\beta\) and \(c\) are correlated with worse estimates for \(h\). This is a concerning tradeoff.

Best prior parameter choices

Fat to middling gamma shape.

Beta rate (\(\lambda\)) doesn’t matter that much, but 2 works best.

The best choices for your prior parameters in this situation are a moderately loose \(\lambda\) of 2 and a fat to middling gamma distribution.

Caveats

I only tried centered the gamma prior on the cutpoints in between the two \(h\). Could also try centering it closer to one transition or the other, or changing skew.

While models run on the slow transition dataset were less likely to have problems like with rhat, etc (given the default options of the stan call), it was also harder to estimate their parameters well. That’s not ideal? But maybe rhat, etc problems can be fixed? I didn’t investigate very thoroughly what was going wrong.

Recapture parameters with induced dirichlet

Let’s repeat this analysis using an induced dirichlet prior on cutpoints. I’ll use the same simulated data as above.

I’ll test the same \(\lambda\) parameters as above. I’ll also vary the anchor parameter in the induced dirichlet prior.

In his example, Betancourt sets the anchor to 0. I don’t completely understand how the anchor works, so I’ll try 5 and 10 as well and see what happens.

I’m going to fit all three simulated datasets with each of the 3 rate prior parameters on beta and anchor parameters on the cutpoints.

Prior parameters - induced dirichlet prior models. modelids are unique identifiers for each combination of prior parameters and simulated datasets, in order from fast transition to slow transition dataset
modelids anchor beta_rate
1, 2, 3 0 1
4, 5, 6 0 2
7, 8, 9 0 3
10, 11, 12 5 1
13, 14, 15 5 2
16, 17, 18 5 3
19, 20, 21 10 1
22, 23, 24 10 2
25, 26, 27 10 3

The Stan program implementing these models is

## //Betancourt's ordinal logistic with a dirichlet prior
## functions {
##     real induced_dirichlet_lpdf(vector c, vector alpha, real phi) {
##         int K = num_elements(c) + 1;
##         vector[K - 1] sigma = inv_logit(phi - c);
##         vector[K] p;
##         matrix[K, K] J = rep_matrix(0, K, K);
## 
##         // Induced ordinal probabilities
##         p[1] = 1 - sigma[1];
##         for (k in 2:(K - 1))
##             p[k] = sigma[k - 1] - sigma[k];
##         p[K] = sigma[K - 1];
## 
##         // Baseline column of Jacobian
##         for (k in 1:K) J[k, 1] = 1;
## 
##         // Diagonal entries of Jacobian
##         for (k in 2:K) {
##             real rho = sigma[k - 1] * (1 - sigma[k - 1]);
##             J[k, k] = - rho;
##             J[k - 1, k] = rho;
##         }
## 
##         return   dirichlet_lpdf(p | alpha)
##         + log_determinant(J);
##     }
## }
## 
## data {
##     int<lower=1> N;             // Number of observations
##     int<lower=1> K;             // Number of ordinal categories
##     int<lower=1, upper=K> y[N]; // Observed ordinals
##     vector[N] x;                  // Covariate
##     real<lower=0> beta_rate;           // beta rate parameter
##     real<lower=0> anchor;         // anchor parameter for induced dirichlet
## }
## 
## parameters {
##     //real gamma;       // Latent effect
##     positive_ordered[K - 1] c; // (Internal) cut points
##     real beta; //
## }
## 
## model {
##     vector[N] gamma;
## 
##     // Prior model
##     beta ~ exponential(beta_rate);
##     c ~ induced_dirichlet(rep_vector(1, K), anchor);
## 
##     // Observational model
## 
##   for (i in 1:N ) {
##     gamma[i] = beta*x[i];
##     y[i] ~ ordered_logistic(gamma[i], c);
##     }
## }
## 
## generated quantities {
##     //vector[N] gamma_ppc;
##     real h1;
##     real h2;
##     //int<lower=1, upper=K> y_ppc[N];
## 
##     h1 = c[1]/beta;
##     h2 = c[2]/beta;
## 
##     // for (n in 1:N) {
##     //     gamma_ppc[n] = beta*x[n];
##     //     y_ppc[n] = ordered_logistic_rng(gamma_ppc[n], c);
##     // }
## }

Model check

Do a dangerously cursory check for model problems - divergences and issues with rhats and neff_ratios.

Models with divergences, rhats > 1, or neff_ratios < 0.1 - induced dirichlet
modelid bad_proportion bad_count divergences_mean bad_rhats_mean bad_neff_mean
1 0.1 1 0 3 0
13 0.2 2 0 2 0
17 0.1 1 0 2 0
19 0.1 1 0 2 0
22 0.1 1 0 3 0
23 0.1 1 0 3 0
4 0.1 1 0 3 3
7 0.1 1 0 3 0
8 0.1 1 0 3 0

9 out of 27 models had problems with Rhat or neff_ratio in 1-6 of the model runs. Only model 4 had neff_ratio problems. Again, no divergences.

model parameterization-dataset combinations that never had an (very obvious) problem
transition beta c.1 c.2 beta_rate anchor h1 h2 modelid
medium 1.0 10 15.0 1 0 10 15 2
slow 0.5 5 7.5 1 0 10 15 3
medium 1.0 10 15.0 2 0 10 15 5
slow 0.5 5 7.5 2 0 10 15 6
slow 0.5 5 7.5 3 0 10 15 9
fast 2.0 20 30.0 1 5 10 15 10
medium 1.0 10 15.0 1 5 10 15 11
slow 0.5 5 7.5 1 5 10 15 12
medium 1.0 10 15.0 2 5 10 15 14
slow 0.5 5 7.5 2 5 10 15 15
fast 2.0 20 30.0 3 5 10 15 16
slow 0.5 5 7.5 3 5 10 15 18
medium 1.0 10 15.0 1 10 10 15 20
slow 0.5 5 7.5 1 10 10 15 21
slow 0.5 5 7.5 2 10 10 15 24
fast 2.0 20 30.0 3 10 10 15 25
medium 1.0 10 15.0 3 10 10 15 26
slow 0.5 5 7.5 3 10 10 15 27

Slow transition models are least likely to have obvious problems.

The models have pretty consistent estimates across runs, even when there are problems with model diagnostics, but they are pretty consistently wrong in their estimates - parameters are usually underestimated. Cutpoints are quite sensitive to the anchor. \(h_1\) is estimated much better than \(h_2\), which is always underestimated

Recapture rate

Of the 5 parameters (\(\beta\), \(c\) and derived \(h\)), how many are recaptured by the model?

Using an induced dirichlet prior on the cutpoints doesn’t seem to work very well. Most parameters just aren’t recaptured most of the time.

Which parameters are easier or harder to recapture?

When transitions are fast, no anchor allows \(\beta\) and \(c\) to be recaptured, but half transition points \(h\) are (but \(h_2\) isn’t ever captured in 50% HPDI, only 90%). A larger anchor might help with this since cutpoints here are 20 and 30.

When transitions are medium speed, a large anchor allows all parameters to be recaptured in the 90% HPDI. Any anchor allows \(h\) to be recaptured in the 90%, but \(h_2\), again, is never captured in the 50%. Cutpoints are 10 and 15 here. A slightly larger anchor may help here as well.

When transitions are slow, \(h2\) is never recaptured, but all other parameters are recaptured in the 90% HPDI. \(\beta\) is recaptured better when anchors are lower, \(c_1\) is captured better when anchor is 5, and \(h_1\) is captured better when anchors are large (10).

Interpretation

\(c_1\) are 5, 10, and 20 for slow, medium, and fast transition datasets, respectively. Models were tried with anchors at 0, 5, and 10. I think the dataset with a fast transition is so hard to fit a model to because an anchor of 10 is too far away from a \(c_1\) at 20. The model does recapture \(c\):\(\beta\). Since \(c\) is underestimated, \(\beta\) ends up being underestimated as well to preserve the ratio.

The induced dirichlet prior assumes the cutpoints are centered around the anchor. If I think reasonable ranges for my cutpoints are \(\beta * h_1 = 1*10\) to \(\beta * h_2 = 2*15\), then 20 might be a reasonable anchor. If I think the range is wider (\(\beta * h_1 = 0.5 * 10\) to \(\beta * h_2 = 3*15\)), I’m not sure what a reasonable anchor might be. When \(\beta=0.5\), the model never recaptured \(h_2\). The anchor might struggle to pull small enough cutpoints - and have a small enough distance between cutpoints if \(\beta\) is small and the anchor is large.

Best prior parameter choices

I’m not sure! I think 20 to 25 is probably reasonable given what I know about the system.

Gamma vs induced dirichlet prior on the cutpoints

I might be fighting more fit problems in Stan with a gamma prior than an induced dirichlet. Slow transition data fit with models with dirichlet priors are both less likely to have obvious fit problems (like sad rhats) and to return the correct parameter values for slow transition datasets. This is better than the situation with the gamma prior where slow transition datasets were also less likely to have obvious fit problems, but also to estimate parameters badly when it did so.

On the other hand, my data doesn’t have a slow transition - but I could make it have a slow transition by re-scaling \(x\).

This model seems fussy and I’m concerned about how it’s going to behave when I increase the complexity from \(\beta x\) to something like \(\beta x + \alpha_{site} + \alpha_{prov} + \alpha_{clone})\).

tl;dr

Ordinal logistic models with gamma priors have fit issues and ordinal logistic models with dirichlet priors can’t recapture the parameters. If fit/stan diagnostic issues can be dealt with enough for the gamma prior, it may be workable. But things will likely explode when adding groups.